2016-03-09

Brief background

  • Computational neuroscientist

  • More broadly involved in computational biology and bioinformatics.

  • Director of the Masters in Computational Biology

  • Involved in http://bigdata.cam.ac.uk strategic initiative and scoping workshops for Alan Turing Institute.

What is data science?

"We have lots of data – now what?"

Data science is deep knowledge discovery through data inference and exploration. This discipline often involves using mathematic and algorithmic techniques to solve some of the most analytically complex business problems, leveraging troves of raw information to figure out hidden insight that lies beneath the surface. It centers around evidence-based analytical rigor and building robust decision capabilities.

Source: https://datajobs.com/what-is-data-science

Who is a data scientist

  • Someone who primarily works with data where detailed domain-specific knowledge need not be required.

  • Computing/maths backgrounds predominant.

  • Application fields: where-ever there are data!

What does a data scientist do?

  • 50-90% of time aggregating/pre-processing data (bioinformatics)

  • Working with domain experts to translate data into "feature vectors"

  • Developing new algorithms

  • data mining / prediction

  • interpreting

  • implicit assumptions: high volume, fully-automated

What are the new areas in data science?

e.g. compared to (e.g. applied statistics) 10 years ago , where are new ideas coming from?

  • Advances in computing hardware

  • Software environments (R, knitr)

  • Open data and open computing

  • Ready access to diverse data sets

Computing hardware

  • Machines are getting faster

  • Easy ways to exploit 'embarrassingly parallel' problems on clusters

  • Graphical Processing Units

  • Connectivity

  • Cloud computing for large-scale applications

  • Custom solutions e.g. http://genestack.com to minimise data transfers

Software environments

  • Gradual emergence of new languages beyond C, matlab: python, R, julia.

  • R is number 1 in computational biology thanks to CRAN and Bioconductor.

  • Importing ideas from software engineering for research, e.g. testing

  • Automated builds and Docker.

Reproducible research

Reproducible research

Knitr documents (Eglen et al 2014)

"Basic features of recordings … We currently have 366 recordings in the repository, occupying 298 MB on disc"

Embedding code into documents

\begin{figure}[h]
  \centering
  <>=
  keys.f <- as.factor(keys)
  n.labs <- nlevels(keys.f)
  lab.cols = brewer.pal(n=n.labs, name='Set3')
  lab.cols[2] = "#444444"
  plot(counts, durns/60, log='xy', pch=20, col=lab.cols[keys.f],
         xlab='Number of spike trains',
         xlim=c(2, 2000),
         ylab='Duration of recording (min)', bty='n', las=1)
  legend('topleft', legend=levels(keys.f), cex=0.8,
           ncol=2, col=lab.cols[1:n.labs], pch=19)
  @
\caption{Basic features of recordings in the repository.  ...
We currently have \Sexpr{length(h5.files)} recordings in
the repository, occupying \Sexpr{file.size.mb} MB on disc.
\end{figure}

Full source code

Why bother?

Reviews

I would use an ordinate log scale for this bottom right panel (as
done in Fig.  3). But since the authors gave me everything, I can
do it! by redefining fourplot as follows:

Examples from computational biology

Breast cancer classification

Curtis C et al. (2012) The genomic and transcriptomic architecture
of 2,000 breast tumours reveals novel subgroups. Nature
486:346–352. 

How to move beyond morphological markers (and two key markers) used currently in clinics?

Hippocampal vs Cortical networks

Development of features

Classification

Given a recording, can we predict if it is CTX or HPC?

SVM Classification accuracy 85–95%.

Image Net application

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems 25 (Pereira F, Burges CJC, Bottou L, Weinberger KQ, eds), pp 1097–1105.

  • Top-1 (37.5 %) and top-5 (17.0 %) error rate lower than previous state of the art.

  • Won 2012 competition with top-5 test error rate of 15 %; second had error rate of 26 %.

ImageNet architecture

Network split onto 2 GPUs.

150K pixels (224 x 224 x 3) into 8 layer network: 253K-186K-65K-65K-43K-4K-4K-1K.

i.e. 650,000 neurons (excluding input), 60 million parameters.

(Today's AlphaGO winner uses similar architecture.)

ImageNet performance

ImageNet close-matches

ImageNet receptive fields

Big data?

Big data

  • Human genome project likely to match other domains, or entities like Facebook and Youtube (Stephens et al 2015).
  • Given enough data you may always find something …

Spurious correlations

Crosswords

More data or more complex algorithms?

Discuss!

Although some areas of biology are high-throughput, getting enough data is often a problem in most areas.

Rise of the data paper.

Generate data and let others mine it?

Open science

Open data sets promote reuse and discovery

  • Early 2000s, Nature et al required genomic datasets to be deposited in public repositories.

  • Rise of Open access papers leading to REF2020 requirements.

  • EPSRC "data availability" statement since May 2015.

  • Crowd sourcing approaches to problem solving.

Open competitions to drive research

  • Netflix competition in 2006 (https://en.wikipedia.org/wiki/Netflix_Prize)

  • Challenge: predict user ranking of movies given limited user data. Desired 10% improvement in prediction to win $1M USD.

  • Finished in 2009 after prize won by "coalition" of teams. Algorithm freely available as condition of entry.

A small competition can also be useful

  • Javier Orlandi (Barcelona).

Cambridge developments

Alan Turing Institute

http://www.turing.ac.uk

  • Government investment of 42M. Cambridge, Oxford, UCL, Warwick, Edinburgh (5x5M).

  • Intel, Lloyds, GCHQ have already signed strategic partnerships.

  • Hub in British Library, launched Nov 2015.

Alan Turing Institute mission

  1. Fundamental algorithm research

  2. Training (PhD/postdoc scheme)

  3. Translation of ideas into practice

  4. Collaborations with companies, universities, public bodies and charities

Scoping workshops included big data in biology and health as key areas of interest.

Showcase event

http://www.bigdata.cam.ac.uk

University of Cambridge Mathematics and Big Data Showcase

Wednesday 20th April 2016

Centre for Mathematical Sciences, Cambridge.

http://www.turing-gateway.cam.ac.uk/mbd_apr2016.shtml

Subjects of talks will include industrial areas such as materials
and chemical decontamination, mathematical biology, financial
maths, cosmology, communications and social sciences. The Showcase
presents an excellent opportunity to bring together scientists
from mathematics and other disciplines such as physics, chemistry,
engineering etc, with interested parties from industry, government
and public sectors.